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Abstract 


This research proposes and evaluates techniques for se- 
lecting predicates for conditional program properties — that 
is, implications such as p = q whose consequent must be 
true whenever the predicate is true. Conditional properties 
are prevalent in recursive data structures, which behave dif- 
ferently in their base and recursive cases, in programs that 
contain branches, in programs that fail only on some inputs, 
and in many other situations. The experimental context of 
the research is dynamic detection of likely program invari- 
ants, but the ideas are applicable to other domains. 

Trying every possible predicate for conditional proper- 
ties is computationally infeasible and yields too many un- 
desirable properties. This paper compares four policies for 
selecting predicates: procedure return analysis, code con- 
ditionals, clustering, and random selection. It also shows 
how to improve predicates via iterated analysis. An experi- 
mental evaluation demonstrates that the techniques improve 
performance on two tasks: statically proving the absence of 
run-time errors with a theorem-prover, and separating faulty 
from correct executions of erroneous programs. 


1 Introduction 


The goal of program analysis is to determine facts about 
a program. The facts are presented to a user, depended on 
by a transformation, or used to aid another analysis. The 
properties frequently take the form of logical formulae that 
are true at a particular program point or points. 

The usefulness of a program analysis depends on what 
properties it can report. A major challenge is increasing the 
grammar of a program analysis without making the analy- 
sis unreasonably more expensive and without degrading the 
quality of the output, when measured by human or machine 
users of the output. 

This paper investigates techniques for expanding the out- 
put grammar of a program analysis to include implications 
of the form a = b. Disjunctions such as a V b are a spe- 
cial case of implications, since (a => b) = (7a V b). Our 


implementation and experimental evaluation are for a spe- 
cific dynamic program analysis that, given program execu- 
tions, produces likely invariants as output. The base analy- 
sis reports properties such as preconditions, postconditions, 
and object invariants that are unconditionally true over a test 
suite. (Section 2.3 describes the technique.) 


A conditional property is one whose consequent is 
not universally true, but is true when the predicate is 
true. (Equivalently, the consequent is false only when 
the predicate is false.) For instance, the local invariant 
over a node n of a sorted binary tree, (n.left.value < 
n.value) A (n.right.value > n.value), is true unless one 
of n, n.left, or n.right is null. Conditional properties are 
particularly important in recursive data structures, where 
different properties typically hold in the base case and the 
recursive case. The predicates are also useful in other do- 
mains. For instance, it can be challenging to select predi- 
cates for predicate abstraction [BMMR0O1]. A related con- 
text is determining whether to discard information at join 
points in an abstract interpretation such as a dataflow anal- 
ysis. 

Extending an analysis to check implications is trivial. 
However, it is infeasible for a dynamic analysis to check 
a => b for all properties a and 6 that the base analysis can 
produce. One reason is runtime cost: the change squares 
the number of potential properties that must be checked. A 
more serious objection concerns output accuracy. Checking 
(say) 100 times as many properties is likely to increase the 
number of false positives by a factor of 100. This is accept- 
able only if the number of true positives is also increased 
by a factor of 100, which is unlikely. False positives in- 
clude properties that are true over the inputs but are not true 
in general. In the context of interaction with humans, false 
positives also include true properties that are not useful for 
the user’s current task. 


Since it is infeasible to check a = 6b for every a and 
b, the program analysis must restrict the implications that 
it checks. We propose to do so by restricting what proper- 
ties are used for the predicate a, while permitting b to range 


over all properties reportable by the analysis. We use split- 
ting conditions to partition the data under analysis, and then 
combine separate analysis results to create implications or 
conditional properties. Splitting conditions limit the predi- 
cates that are considered, but predicates that are not splitting 
conditions may still appear. We also present a technique that 
leverages the base analysis to refine imprecise predicates via 
an iterated analysis. 

This paper presents four policies (detailed in Section 3) 
for selecting predicates for implications: procedure return 
analysis; code conditionals; clustering; and random selec- 
tion. The last two, which performed best in our experimen- 
tal evaluation, are dynamic analyses that examine program 
executions rather than program text; the second one is static; 
and the first is a hybrid. Dynamic analyses can produce in- 
formation (predicates) about program behavior that is not 
apparent from the program text — for instance, general alias 
analysis remains beyond the state of the art, but runtime be- 
havior is easy to observe. Also, the internal structure of the 
source code does not effect the dynamic policies. It also 
enables them to work on programs for which source code 
is not available, so long as the underlying program analysis 
does not require source code. 

We evaluated the four policies in two different ways. 
First, we compared the accuracy of the produced proper- 
ties, where accuracy is measured by a program verification 
task (Section 4.1); the policies produced implications that 
reduced human effort by 40%. Second, we determined how 
well each of the policy choices exposes errors (Section 4.2); 
12% of the implications directly reflected differences due to 
faulty behavior, even without foreknowledge of the faults. 

The remainder of this paper is organized as follows. Sec- 
tion 2 proposes mechanisms for detecting and refining im- 
plications. Section 3 describes the four policies that deter- 
mine which implications will be computed, and Section 4 
evaluates them. Section 5 discusses related work, and Sec- 
tion 6 recaps our contributions. 


2 Detecting implications 


Figure | shows the mechanism for creation of implica- 
tions. Rather than directly testing specific a = b invariants, 
the analysis splits the input data into two mutually exhaus- 
tive parts based on a client-supplied predicate, which we 
call a splitting condition. The splitting conditions are not 
necessarily the same as the implication predicates (see Sec- 
tion 2.1). This paper focuses on automating the selection of 
splitting conditions, which are analogous to the predicates 
of predicate abstraction. 

After the data is split into two parts, the base program 
analysis is performed to detect (non-implication) properties 
in each subset of the data. Finally, implications are gener- 
ated from the separately-computed properties, if possible. 


x=-9, y=3 
x= 2, y=4 
x= 0, y=1 
x=-1, y=.5 
x=-7, y=8 
x= 4, y=16 
1. Split the data into parts 
yes Kout> no 
x=2, y=4 x=-9, y=3 
x=0, y=1 x=-1, y=.5 
x=4, y=16 x=-7, y=8 
x even 
2. Compute properties x>0 ibe: 
over each subset of data y=2 y>0 
y>0 


xevenSx>0 
x even => y = 2% 
x>0sSy=2 


3. Compare results, 
produce implications 


Figure 1. Mechanism for creation of implications. In the figure, 
the analysis is a dynamic one that operates over program traces. 
Figure 2 gives the details of the third step. 


// S; and So are sets of properties resulting from 
// analysis of partitions of the data. 
procedure CREATE-IMPLICATIONS(S}, S2) 
for all p,; € S; do 
if dp. € S» such that p; => 7p. and po > 7p, then 
// py; and p2 are mutually exclusive 
for all p & (S41 — Soy- {pi}) do 
output “py = p’” 


Figure 2. Pseudocode for creation of implications from properties 
over partitions of the data. (In our experiments, the underlying 
data is be partitioned into two sets of executions; other analyses 
might partition paths or other code artifacts.) Figure 3 shows an 
example of the procedure’s input and output. 


If the splitting condition is poorly chosen, or if no implica- 
tions hold over the data, then the same properties are com- 
puted over each subset of the data, and no implications are 
reported. 

Figure 2 gives pseudocode for creation of implications 
from properties over subsets of the data, which is the third 
step of Figure 1. The CREATE-IMPLICATIONS routine is 
run twice, swapping the arguments, and then the results are 
simplified according to the rules of Boolean logic. Figures 1 
and 3 give concrete examples of the algorithm’s behavior. 

Each mutually exclusive property implies everything 
else true for its own subset of the data. (This is true only 
because the two subsets are mutually exhaustive. For in- 
stance, given a mechanism that generates three data sub- 
sets inducing property sets {a,b}, {7a, —b}, {a, bd}, it is 


Properties Implications Simplified 
Si So a=>b 7a => —7b a= b 
a aa a=>d -a=>f a=>d 
b ab a>e ab > 7a a>e 
Cc Cc b>a —b => f aa > f 
d f b=>d 

e b>e 


Figure 3. Creation of implications from properties over subsets of 
the data. The left portion of the figure shows S; and So, sets of 
properties over two subsets of the data; these subsets resulted from 
some splitting condition, which is not shown. The middle portion 
shows all implications that are output by two calls to the CREATE- 
IMPLICATIONS routine of Figure 2; c appears unconditionally, so 
does not appear in any implication. The right portion shows the 
implications after logical simplification. 


not valid to examine only the first two subsets of the data 
and to conclude that a = b.) The algorithm does not re- 
port self-implications or any universally true property as the 
consequent of an implication, since the universal property 
appears unconditionally. In other words, if c is universally 


66? 


true, there is no sense outputting “a = c” in addition to “c”. 


2.1 Splitting conditions and predicates 


The left-hand-sides of implications resulting from the 
above procedure may differ from the splitting conditions 
used to create the subsets of the data. Some splitting con- 
ditions may not be left-hand-sides, and some non-splitting- 
conditions may become left-hand-sides. 

Any properties detected in the subsets of the data— not 
just splitting conditions — may appear as implication pred- 
icates; for example, x > 0 > y = 2” in Figure 1. This is 
advantageous when the splitting condition is not reported 
(for instance, is not expressible in the underlying analysis); 
it permits strengthening or refining the splitting condition 
into a simpler or more exact predicate; and it enables re- 
porting more implications than if predicates were limited to 
pre-specified splitting conditions. 

In practice, the splitting condition does appear as a left- 
hand-side, because it is guaranteed to be true of one subset 
of the data (and likewise for its negation). However, there 
are three reasons that the splitting condition (or its negation) 
might not be reported in a subset of the data. 


1. The splitting condition may be inexpressible in the 
analysis tool’s output grammar. For example, the 
Daikon invariant detector (Section 2.3), which we 
used in our experimental evaluation, allows as a 
splitting condition any Java boolean expression, in- 
cluding program-specific declarations and calls to li- 
braries. The Daikon implementation permits inex- 
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Figure 4. Producing splitting conditions via iterated analysis. The 
final splitting conditions make no mention of augmentations, if 
any, to the data traces. The final splitting conditions are used as 
shown in Figure 6. 
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pressible splitting conditions to be passed through as 
implication predicates. However, those inexpressible 
invariants are usually beyond the capabilities of other 
tools such as static checkers (see Section 4.1), so we 
omit them in our experiments. 

2. A stronger condition may be detected; the weaker, im- 
plied property need not be reported. 

3. The splitting condition may not be statistically justi- 
fied. A dynamic or stochastic analysis may use statis- 
tical tests to avoid overfitting based on too little input 
data. This can occur, for example, if one of the subsets 
is very small. Such statistical suppression is relatively 
infrequent, prevents many false positives, and rarely 
reduces the usefulness of the result [ECGNOO]. 


2.2 Refining splitting conditions 


A splitting policy may propose a good but imperfect par- 
tition of the data that does not precisely match true differ- 
ences of behavior. Or, a splitting policy may use a (possibly 
costly) external analysis or other information that is not typ- 
ically available in the program trace or is hard for people to 
understand. We propose a two-pass process that performs 
program analysis twice — the first time to produce a refined 
set of splitting conditions, and the second time to output im- 
plications — to correct both problems. The refined splitting 
conditions are the right-hand-sides (consequents) of the im- 
plications discovered in the first pass; see Figure 4. This 
approach has two benefits. 

First, the two-pass process produces a set of splitting 
conditions in terms of program quantities. They are human- 
readable, easing inspection and editing, and they can be 
reused during other analysis steps. 

Second, the first program analysis pass helps to refine the 
initial splitting conditions. Statistical or inexact techniques 
may not partition the data exactly as desired. However, as 
long as at least one subset induces one of the desired proper- 
ties, the first program analysis pass can leverage this into the 
desired splitting condition. If the original splitting condition 
produces the desired grouping, then the additional pass does 
no harm. 

As a simple example, consider Figure 5. Suppose that 
the A points have properties numSides = 3 and x < 0, and 


aa bake Subset 2 
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Figure 5. Refinement of initial splits via an extra step of pro- 
gram analysis. The initial subsets approximate, but do not ex- 
actly match, the natural division in the data (between triangles and 
squares, at x = 0). An extra program analysis step produces the 
desired splitting condition. 


the C points have properties numSides = 4and x > 0. The 
initial splitting condition (displayed as two subsets) nearly, 
but not exactly, matches the true separation between behav- 
iors. The first pass, using the subsets as the splitting con- 
dition, would produce “subset = 1 => numSides = 3” 
and “subset = 1 = x < 0”. The refined splitting condi- 
tions are numSides = 3 and x < 0. The second program 
analysis pass yields the desired properties: “(x > 0) > 
(numSides = 4)” and “x < 0 > (numSides = 3)”. 
The clustering policy (Section 3.3) uses this two-pass strat- 
egy directly, and the random selection policy (Section 3.4) 
relies on a similar refinement of an imperfect initial data 
subsetting. 


2.3. Background: Dynamic invariant detection 


This section briefly describes dynamic detection of likely 
program invariants [Ern00, ECGNO1], which we use in our 
experimental evaluation of predicate selection. The tech- 
niques of this paper should be applicable to other static 
and dynamic program analyses as well. The experiments 
in this paper use the Daikon implementation, which reports 
representation invariants and procedure preconditions and 
postconditions. The techniques of this paper for producing 
implications are publicly available in the Daikon distribu- 
tion (http: //pag.1lcs.mit.edu/daikon) and have 
been successfully used for several years. 

We only briefly explain dynamic detection of likely 
invariants — enough to appreciate the experiments — but 
full details may be found elsewhere [ECGNOO, Ern00, 
ECGNO1]. 

Dynamic invariant detection discovers likely invariants 
from program executions by instrumenting! the target pro- 


'The instrumentation may be over source code, over object code, or 
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Figure 6. Architecture of the Daikon tool for dynamic invariant 
detection. The “splitting conditions” input is optional and enables 
detection of implications of the form “a = b” (see Section 2). 
This paper proposes and evaluates techniques for selecting split- 
ting conditions. 


gram to trace the variables of interest, running the instru- 
mented program over a test suite, and inferring invariants 
over the instrumented values (Figure 6). 

The inference step creates many potential invariants (es- 
sentially, all instantiations of a set of templates), then tests 
each one against the variable values captured at the instru- 
mentation points. A potential invariant is checked by ex- 
amining each sample (i.e., tuple of values for the variables 
being tested) in turn. As soon as a sample not satisfying the 
invariant is encountered, that invariant is known not to hold 
and is not checked for any subsequent samples. Because 
false invariants tend to be falsified quickly, the cost of de- 
tecting invariants tends to be proportional to the number of 
invariants discovered. All the invariants are inexpensive to 
test and do not require full-fledged theorem-proving. 

The invariant templates include about three dozen prop- 
erties over scalar variables (e.g., x < y, z= ax + by +c) 
and collections (e.g., mylist is sorted, x € myset). Statisti- 
cal and other techniques further improve the system’s per- 
formance. As with other dynamic approaches such as test- 
ing and profiling, the accuracy of the inferred invariants de- 
pends in part on the quality and completeness of the test 
cases. 


3 Policies for selecting predicates 


We have reduced the problem of detecting predicates to 
that of selecting (approximate) splitting conditions. This 
section describes four policies for detecting splitting con- 
ditions that we experimentally evaluated: procedure return 
analysis, code conditionals, clustering, and random selec- 
tion. 


3.1 Procedure return analysis 


This section describes a simple splitting policy based on 
two dynamic checks of procedure returns. The first check 


performed by a run-time system. For example, of the six Java front ends 
for Daikon of which we are aware, two fall into each category. The ex- 
periments reported in this paper used source code instrumentation, and 
required no modification or enhancement of the instrumenter. 


splits data based on the return site. If a procedure has mul- 
tiple return statements, then it is likely that they exhibit 
different behaviors: one may be a normal case and the other 
may be an exceptional case, a fast-path computation, a base 
case, or different in some other manner. The second check 
splits data based on boolean return values, separating cases 
for which a procedure returns true from those for which it 
returns false. 


3.2 Static analysis for code conditionals 


The code conditional policy is a simple static analysis 
that selects each boolean condition used in the program (as 
the test of a if, while, or for statement, or as the body 
of a pure boolean member function) as a splitting condition. 

The rationale for this approach is that if the programmer 
considered a condition worth testing, then it is likely to be 
relevant to the problem domain. Furthermore, if a test can 
affect the implementation, then that condition may also af- 
fect the externally visible behavior. 

The Daikon implementation permits splitting conditions 
to be associated with a single program point (such as a pro- 
cedure entry or exit) or to be used at all program points that 
contain variables of the same name and type. Our experi- 
ments use the latter option. For instance, a condition might 
always be relevant to the program’s state, but might only be 
statically checked in one routine. Other splitting policies 
also benefit from such cross-fertilization of program points. 


3.3 Clustering 


Cluster analysis, or clustering [JMF99], is a multivariate 
analysis technique that creates groups, or clusters, of self- 
similar datapoints. Clustering aims to partition datapoints 
into clusters that are internally homogeneous (members of 
the same group are similar to one another) and externally 
heterogeneous (members of different groups are different 
from one another). Data splitting shares these same goals. 

As described in Section 2.2 and illustrated in Figure 4, 
the clustering policy uses a two-pass algorithm to refine its 
inherently approximate results. 

Clustering operates on points in an n-dimensional space. 
Each point is a single program point execution (such as a 
procedure entry), and each dimension represents a scalar 
variable in scope at that program point. We applied cluster- 
ing to each program point individually. Before performing 
clustering, we normalized the data so that each dimension 
has a mean of 0 and a standard deviation of 1. This ensures 
that large differences in some attributes (such as hash codes 
or floating-point values) do not swamp smaller differences 
in other attributes (such as booleans). 

The experiments reported in this paper use x-means clus- 
tering [PMO00], which automatically selects an appropriate 
number of clusters. We repeated the experiments with k- 
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Figure 7. The randomized algorithm for choosing splitting condi- 
tions. The technique outputs each invariant that is detected over 
a randomly-chosen subset of the data, but is not detected over the 
whole data. The “detect invariants” steps are non-conditional in- 
variant detection. 
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Figure 8. Likelihood of finding an arbitrary split via random se- 
lection. s is the size of each randomly-chosen subset, and r is the 
number of such subsets. 


means and hierarchical clustering techniques; the results 
differed little [Dod02]. 


3.4 Random selection 


Figure 7 shows a randomized analysis for selecting split- 
ting conditions. First, select r different subsets of the data, 
each of size s, and perform program analysis over each sub- 
set. Then, use any property detected in one of the subsets, 
but not detected in the full data, as a splitting condition. 

As with the clustering technique, the randomly-selected 
subsets of the data need not perfectly separate the data. Sup- 
pose that some property holds in a fraction f < 1 of the 
data. It will never be detected by non-conditional program 
analysis. However, if one of the randomly-selected subsets 
of the data happens to contain only datapoints where the 
property holds, then the condition will be detected (and re- 
detected when the splitting conditions are used in program 
analysis). 

Figure 8 shows how likely a property is to be detected by 
this technique, for several values of s and r. The property 
holds in all s elements of some subset with probability p; = 
f*. Thus, the property is detected with probability p; on 


each trial for a given subset of size s. The negation holds 
with probability po = (1— f)*. The property or its negation 
holds on at least one of the subsets with probability 1—(1— 
pi)” +1—-(1-— pe)" , which is graphed in Figure 8. 

The random selection technique is most effective for un- 
balanced data; when f is near .5, it is likely that both an ex- 
ample and a counterexample appears in each subset of size 
5. We believe that many interesting properties of data are 
at least moderately unbalanced. For example, the base case 
appears infrequently for data structures such as linked lists; 
unusual conditions or special-case code paths tend to be ex- 
ecuted only occasionally; and the errors that are most diffi- 
cult to identify, reproduce, and track down manifest them- 
selves only rarely. 

The likelihood of detecting a (conditional) property can 
be improved by increasing r or by reducing s. The danger 
of increasing r is that work linearly increases with r. The 
danger of reducing s is that the smaller the subset, the more 
likely that any resulting properties overfit the small sample. 
(By contrast, increasing s makes it less likely that a prop- 
erty holds over an entire size-s subset.) We chose s = 10 
and r = 20 for our experiments; informal experimentation 
and graphs such as Figure 8 had suggested that these values 
were reasonable. We did not attempt to optimize them, so 
other values may perform even better, or may be better in 
specific domains. 

An example illustrates the efficacy of this technique. Our 
first experiment with random splitting applied it to the well- 
known water jug problem. Given two water jugs, one hold- 
ing (say) exactly 3 gallons and the other holding (say) ex- 
actly 5 gallons, and neither of which has any calibrations, 
how can you fill, empty, and pour water from one jug to the 
other in order to leave exactly 4 gallons in one of the jugs? 
We hoped to obtain properties about the insolubility of the 
problem when the two jugs have sizes that are not relatively 
prime. In addition, we learned that minimal-length solu- 
tions have either one step (the goal size is the size of one of 
the jugs) or an even number of steps (odd-numbered steps 
fill or empty a jug, and even-numbered steps pour as much 
of a jug as possible into the other jug). We were not aware 
of this non-obvious property before using random splitting. 


4 Evaluation 


We evaluated Section 3’s four policies for selecting split- 
ting conditions — and thus, for computing implications — 
in two different ways. The first experimental evaluation 
measured the accuracy of program analysis results for a pro- 
gram verification task (Section 4.1). The second experimen- 
tal evaluation measured how well the implications indicated 
faulty behavior (Section 4.2). 

Both tasks are instances of information retrieval [Sal68, 
vR79], so we compute the standard precision and recall 
measures. Suppose that we have a goal set of results and 


a reported set of results; then the matching set is the inter- 


section of the goal and reported sets. Precision, a measure 
|matching| 
|reported] ° 
|matching| 
|goal| 
always between 0 and 1, inclusive. 
Implications that are not in the matching set are still cor- 
rect statements about the program’s conditional behavior; 
these properties simply do not relate to the goal set. Each 
task induces a different goal set, to which different proper- 


ties are relevant, so some imprecision is inevitable. 


of correctness, is defined as Recall, a measure 


of completeness, is defined as . Both measures are 


4.1 Static checking 


The static checking experiment measures how much a 
programmer’s task in verifying a program is eased by the 
availability of conditional properties. Adding implications 
reduced human effort by 40% on average, and in some cases 
eliminated it entirely. 

The programmer task is to change a given (automati- 
cally generated) set of program properties so that it is self- 
consistent and guarantees lack of null pointer dereferences, 
array bounds overruns, and type cast errors. The amount of 
change is a measure of human effort. 

The experiment uses the ESC/Java static checker 
[FLL* 02] to verify lack of runtime errors. ESC/Java issues 
warnings about potential run-time errors and about annota- 
tions that cannot be verified. Like the Houdini annotation 
assistant [FLO1], Daikon can automatically insert its output 
into programs in the form of ESC/Java annotations, which 
are similar in flavor to assert statements. 

Daikon’s output may not be completely verifiable by 
ESC/Java. Verification may require removal of certain an- 
notations that are not verifiable, either because they are not 
universally true or because they are beyond the checker’s 
capabilities. Verification may also require addition of miss- 
ing annotations, when those missing annotations are neces- 
sary for the correctness proof or for verification of other 
necessary annotations. People find eliminating undesir- 
able annotations easy but adding new ones hard (see Sec- 
tion 4.1.1). Therefore, the complement of recall — the num- 
ber of missing properties —is the best measure of how 
much work a human would have to perform in order to ver- 
ify the lack of run-time errors in the code. 

We analyzed the Java programs listed in Figure 9. 
DisjSets, StackAr, and QueueAr come from a data 
structures textbook [Wei99]; Vector is part of the Java 
standard library; and the remaining programs are solutions 
to assignments in a programming course at MIT. Each pro- 
gram verification attempt included client code (not counted 
in the size measures of Figure 9) to ensure that the verified 
properties also satisfied their intended specification. For 
each program and each set of initial annotations, we chose 
a goal set by hand [NE02a]. There is no unique verifiable 


Program size No implications Return Static Cluster Random 
Program LOC | NCNB || Prec. | Recall || Prec. | Recall |} Prec. | Recall || Prec. | Recall || Prec. | Recall 
FixedSizeSet 76 28 1.00 0.86 1.00 0.86 1.00 0.86 1.00 0.86 1.00 0.86 
DisjSets 715 29 0.82 1.00 1.00 0.97 1.00 1.00 1.00 0.94 0.80 0.98 
StackAr 114 50 1.00 0.90 1.00 1.00 0.95 1.00 0.78 1.00 0.95 1.00 
QueueAr 116 56 0.92 0.71 0.98 0.78 0.89 0.84 0.62 0.89 0.77 0.91 
Graph 180 99 0.80 1.00 0.80 1.00 0.80 1.00 0.80 1.00 0.80 1.00 
GeoSegment 269 116 1.00 1.00 1.00 1.00 1.00 1.00 0.94 1.00 0.73 1.00 
RatNum 276 139 0.93 1.00 0.91 1.00 1.00 1.00 0.72 1.00 0.50 1.00 
StreetNumberSet 303 201 0.82 0.95 0.77 0.95 0.77 0.96 0.77 0.96 0.83 0.89 
Vector 536 202 0.96 0.95 0.99 0.95 0.76 0.98 0.71 0.97 0.81 0.97 
RatPoly 853 498 0.81 0.97 0.67 0.95 0.71 0.96 0.68 0.96 0.79 0.95 
Total 4886 | 2451 0.91 0.93 0.91 0.95 0.89 0.96 0.80 0.96 0.80 0.96 
Missing 0.09 0.07 0.09 0.05 0.11 0.04 0.20 0.04 0.20 0.04 


Figure 9. Invariants detected by Daikon and verified by ESC/Java, using four policies for selecting splitting conditions (or using none). 
“LOC” is the total lines of code. “NCNB” is the non-comment, non-blank lines of code. “Prec” is the precision of the reported invariants, 
the ratio of verifiable to verifiable plus unverifiable invariants. “Recall” is the recall of the reported invariants, the ratio of verifiable to 
verifiable plus missing. “Missing” indicates the overall missing precision or recall. The most important measure is the missing recall (in 
bold): it is the most accurate measure of human effort, since it indicates how much humans must add to the reported set. By comparison, 
removing elements from the set — measured by the complement of precision — is an easy task. 


set of annotations, so we chose a verifiable set that we be- 
lieved to be closest to the initial annotations (that is, the 
analysis output). This gives (an upper bound on) the mini- 
mum amount of work a programmer must do, and is a good 
approximation of the actual work a programmer would do. 

Figure 9 gives the experimental results. Return value 
analysis produced the fewest implications, followed by code 
conditionals, clustering, and random splitting. 

As indicated in the “No implications” column of Fig- 
ure 9, even when supplied with no splitting conditions, dy- 
namic invariant detection performs well at this task. (All re- 
sults were obtained without adding or removing invariants 
or otherwise tuning Daikon to the particular set of programs 
or to the ESC/Java checker; however, the results suggest 
that Daikon is well-matched to ESC/Java’s strengths.) 91% 
of the reported properties are verifiable; the other 9% are 
true, but their verification either depends on missing prop- 
erties or is beyond the capabilities of ESC/Java. Further- 
more, 93% of all properties necessary for verification are 
already present; for 4 of the programs, no properties at all 
need to be added. Therefore, there is little room for im- 
provement. Nonetheless, adding implications reduced the 
fraction of missing properties by 40% on average (from .07 
to .04), and by up to 100%. 

As a specific example, consider the QueueAr pro- 
gram [Wei99]. The data structure is an array-based Java 
implementation of a queue. Its fields are: 


Object[] theArray; 
int front; // index of head element 
int back; // index of tail element 


int currentSize; // number of valid elements 


After calling enough enqueue commands on a QueueAr 


object, the back pointer wraps around to the front of the 
array. The front pointer does the same after enough dequeue 
commands. 

In a user study [NEO2b], no users succeeded in writing 
correct QueueAr object invariants in one hour, despite be- 
ing given a textbook’s description of the implementation, 
complete with figures [Wei99]. The most troublesome an- 
notations for users were implications, suggesting that ma- 
chine help is appropriate for suggesting them. 

As indicated in Figure 9, invariant detection without im- 
plications found 71% of the necessary annotations. The 
missing annotations included the following, among other 
similar properties. (For brevity, let size = currentSize 
and len = theArray.length.) 


Properties when the queue is empty 
For example, 


((size = 0) A (front > back)) = (back = front—1)) 


(size =0) > Vi (theArray/i] = null) 
O0<i<len 
The actual ESC/Java annotation inserted by Daikon in 
the second case is as follows; for brevity, we will gen- 
erally present logical formulae instead. 


/*@ invariant 
(currentSize == 0) 
==> (\forall int i; 
(0 <= i && i <= theArray.length-1) 
==> (theArray[i] == null)); */ 


Properties when the concrete rep is wrapped 
For example, 


((size > 0) A (front < back)) 


=> (size = back — front + 1) 


((size > 0) A (front > back) 
=> (size = len + back — front + 1) 


Properties over valid/invalid elements These refer to the 
array locations logically between the front and 
back indices, or between the back and front in- 
dices. For example, 


((size > 0) A (front < back)) 


theA }| = null 
pee ae zi tray mre ) 


((size > 0) A (front < back)) 


(theArray|[i] = null) 
back<i<len 


The random and clustering policies found properties 
from all three categories. By contrast, the code conditionals 
policy only found the properties in the first category. See 
Section 4.3. 


4.1.1 Recall vs. precision 


Adding implications to the set of properties presented to the 
static verifier ESC/Java improved the recall of the properties 
(as measured against a verifiable set) but reduced precision. 
The drop in precision is due to over-fitting: partitioning the 
data leads to more false positives in each subset, particularly 
since the test suites were quite small [NE02a]. 

There are two reasons that reduced precision is not a 
problem in practice. First, recall is more important to peo- 
ple performing the program verification task. Users can eas- 
ily recognize and eliminate undesirable properties [NE02b], 
but they have more trouble producing annotations from 
scratch — particularly implications, which tend to be the 
most difficult invariants for users to write. Therefore, 
adding properties may be worthwhile even if precision de- 
creases. Two of the implications that decreased the cluster- 
ing technique’s precision on the QueueAr program are: 


size > 2) > theArray|back — 1 null 
y 
(front > back) => (size 4 back) 


Humans with modest familiarity with the QueueAr imple- 
mentation quickly skipped over these as candidates for aid- 
ing the ESC/Java verification. 

There is a second reason that lowered precision due to 
implications is not a concern: a human can easily augment 
the test suite to improve the precision. We augmented three 
of the test suites, taking less than one hour for each (and less 
than the verification time). The invariant detector and static 


No implications Cluster Augmented 
Program |) Prec. | Recall | Prec. | Recall | Prec. | Recall 
StackAr 1.00 | 0.90 | 0.78 | 1.00 | 1.00 | 1.00 
QueueAr || 0.92 0.71 0.62 | 0.89 | 0.91 | 0.93 
RatNum || 0.93 1.00 | 0.72 | 1.00 | 0.88 | 1.00 
Total 0.95 0.87 | 0.71 | 0.96 | 0.93 | 0.98 
Missing || 0.05 0.13 | 0.29 | 0.04 | 0.07 | 0.02 


Figure 10. Augmenting test suites to improve precision for static 
verification. The table layout is as in Figure 9. 


verifier output, which indicated which properties had been 
induced from too little data (such as the ones immediately 
above), made it obvious how to improve the test suites. Fig- 
ure 10 shows the accuracy (with respect to the program ver- 
ification task) of cluster-based splitting with both the origi- 
nal and the augmented test suites. Augmentation increased 
precision from .71 to .93 and, unexpectedly, increased recall 
from .96 to .98. The final recall is a substantial reduction to 
2% missing annotations, down from 13% missing annota- 
tions when no implications were present: users need to add 
less than one sixth as many annotations. Thus, results in 
practice (with more reasonable test suites, or with modest 
human effort to improve poor ones) may be even better than 
indicated by Figure 9. 


4.2 Error detection 


Our experiment with error detection evaluates implica- 
tions based on a methodology for helping to locate program 
errors. Errors induce different program behaviors; that is, a 
program behaves differently on erroneous runs than on cor- 
rect runs. One such difference is that the erroneous run may 
exhibit a fault (a deviation from desired behavior). Even if 
no fault occurs, the error affects the program’s data struc- 
tures or control flow. Our goal is to capture those differ- 
ences and present them to a user. As also observed by 
other authors [DDLE02, HLO2, RKSO2, GV03, PRKRO3], 
the differences may lead programmers to the underlying er- 
rors. This is true even if users do no initially know which 
of two behaviors is the correct one and which is erroneous: 
that distinction is easy to make. 

There are two different scenarios in which our tool might 
help to locate an error [DDLE02]. 

Scenario 1. The user knows errors are present, has a 
test suite, and knows which test cases are fault-revealing. 
A dynamic program analysis can produce properties us- 
ing, as a splitting condition, whether a test case is fault- 
revealing. The resulting conditional properties capture the 
differences between faulty and non-faulty runs and expli- 
cate what data structures or variable values underly the 
faulty behavior. The analysis’s generalization over multiple 
faulty runs spares the user from being distracted by specifics 
of any one test case and from personally examining many 
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Figure 11. Evaluation of predicates for separating faulty and non- 
faulty runs. The goal set is the properties that discriminate between 
faulty and non-faulty runs. The reported set is the consequents of 
conjectured implications. The intersection between the goal and 
reported sets is the matching set. 


test cases. 

Scenario 2. The user knows errors exist but cannot re- 
produce and/or detect faults (and perhaps does not even 
know which test cases might yield faults). Alternately, the 
user does not know whether errors exist but wishes to be 
appraised of evidence of potential errors, for instance to fur- 
ther investigate critical code or code that has been flagged 
as suspicious by a human or another tool. 

In scenario 2, we propose to present the user with a set 
of automatically-generated implications that result from dif- 
fering behaviors in the target program. We speculate that, 
if there are errors in the program, then some of the impli- 
cations will reveal the differing behavior, perhaps leading 
the user to the error. (Other implications will be irrele- 
vant to any errors, even if they are true and useful for other 
tasks.) Anecdotal results support the speculation. In many 
cases, after examining the invariants but before looking at 
the code, we were able to correctly guess the errors. It might 
be interesting to test the speculation via a user study. Such a 
study is beyond the scope of this paper, and it is not germane 
to this section’s main point of testing whether our technique 
is able to identify noteworthy differences in behavior. There 
are many potential uses for conditional properties other than 
debugging. 


Our evaluation focuses on scenario 2, because existing 
solutions for it are less satisfactory than those for scenario 1. 

Figure 11 diagrams our evaluation technique. The goal 
set of properties is the ones that would have been created in 
scenario 1, in which the data is partitioned based on whether 
a test case is fault-revealing. We simulate this by detecting 
properties individually on the fault-revealing and non-fault- 
revealing tests. The goal set contains all properties detected 
on one of those inputs but not on the other. 


As a simple example, one NFL program contained a re- 
cursion error, causing faulty output whenever the input was 
not in the base case. The goal properties for this program 
distinguished between the base case and the recursive case 
(and the faulty program also had fewer invariants overall 
than correct ones did [ECGNO01]). As a second example, 
one replace program failed to warn about illegal uses 
of the hyphen in regular expressions. On erroneous runs, 
properties over the replacement routine were different, in- 
cluding relationships between the pre- and post-values of 
indices into the pattern. 


The reported properties are those resulting from running 
the program analysis, augmented by a set of splitting con- 
ditions produced by one of the policies in Section 3. Given 
the goal and reported sets, we compute precision and recall, 
as described at the beginning of Section 4. 


We evaluated our technique over eleven different sets 
of programs totaling over 137,000 lines of code. Each 
set of programs was written to the same specification. 
The NFL, Contest, and Azot programs came from 
the TopCoder programming competition website (www. 
topcoder.com). These programs were submitted by 
contestants; the website published actual submissions and 
test cases. The three programs determined how football 
scores could be achieved, how elements could be distributed 
into bins, and how lines could be drawn to pass through cer- 
tain points. We selected 26 submissions at random that con- 
tained real errors made by contestants. The TopCoder test 
suites are relatively complete, because the contestants aug- 
mented them in an effort to disqualify their rivals. About 
half of the TopCoder programs contained one error, and 
the others contained multiple errors within the same func- 
tion. The RatPoly and CompostiteRoute programs 
were written by students in an undergraduate class at MIT 
(6.170 Laboratory in Software Engineering). Students im- 
plemented a datatype for polynomials with rational coef- 
ficients and a set of algorithms for modeling geographi- 
cal points and paths. We selected all student submissions 
that compiled successfully and failed at least one staff test 
case. These programs, too, contained real, naturally occur- 
ring errors. The students had a week to complete their as- 
signment, unlike the TopCoder competitors who were un- 
der time pressure. Most of the student programs contained 
multiple, distinct errors in unrelated sections of the code. 
The remaining programs were supplied by researchers from 
Siemens [HFGO94, RH98] and are commonly used in test- 
ing research. Every faulty version of the Siemens programs 
has exactly one distinct error. The errors were seeded by hu- 
mans, who chose them to be realistic. The print_tokens 
and print_tokens2 programs (and their errors) are un- 
related. 


Figure 12 summarizes the results of the error detection 
experiment. The results indicate that regardless of the split- 


Program size Return Static Cluster Random 
Program Source || Ver. LOC NCNB |) Prec. | Recall || Prec. | Recall || Prec. | Recall || Prec. | Recall 
NFL TC 10 23 21 0.00 0.00 0.03 0.08 0.03 0.08 0.09 0.37 
Contest TC 10 21 17 0.00 0.00 0.19 0.40 0.11 0.23 0.15 0.21 
Azot TC 6 18 17 0.00 0.00 0.00 0.00 0.13 0.46 0.12 0.15 
RatPoly MIT 32 853 498 0.03 0.00 0.03 0.01 0.07 0.03 0.14 0.09 
CompostiteRoute MIT 67 883 319 0.22 0.09 0.22 0.09 0.21 0.47 0.21 0.45 
print_tokens S 7 703 452 0.00 0.00 0.03 0.13 0.04 0.22 0.04 0.34 
print_tokens2 S 10 549 379 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 
replace S 32 506 456 0.24 0.31 0.14 0.28 0.10 0.42 0.15 0.37 
schedule2 S 9 369 280 0.00 0.00 0.20 0.35 0.20 0.35 0.24 0.41 
tcas S 41 178 136 0.17 0.24 0.19 0.31 0.12 0.23 0.13 0.40 
tot_info S 23 556 334 0.00 0.00 0.12 0.27 0.09 0.40 0.03 0.09 
Total 247 | 137015 | 75115 0.06 0.06 0.10 0.18 0.10 0.26 0.12 0.26 


Figure 12. Detection of conditional behavior induced by program errors, compared for four splitting policies. (When no splitting is in 
effect, the precision and recall are always zero.) “Ver” is the number of versions of the program. “LOC” is the average total lines of code in 
each version. “NCNB” is the average non-comment, non-blank lines of code. “Prec” is the precision of the reported invariants, the ratio of 
matching to reported. “Recall” is the recall of the reported invariants, the ratio of matching to goal. In this experiment, precision (in bold) 
is the most important measure: it indicates how many of the reported implications indicate erroneous behavior. In cases where precision is 
0.00, the experiment did report some implications, but none of them contained consequents in the goal set. 


ting policy, the technique is effective. For this experiment, 
the precision measurements are the most important. (Low 
recall measures are not a concern. Typically an error in- 
duces many differences in behavior, and recognizing and 
understanding just one of them is sufficient.) 

The precision measurements show that on average 6— 
12% of the reported properties indicate a difference in be- 
havior between succeeding and failing runs. In other words, 
a programmer using our methodology with random sam- 
pling could expect to examine 8 reported implications be- 
fore discovering one that indicates the difference between 
correct and erroneous behavior. (The other 7 implications 
may be useful for other tasks, but not for error detection. 
The Siemens programs and some MIT student programs 
contained only one or two faults in a program averaging 
617 lines long, so there is a lot of conditional behavior in 
the programs that has nothing to do with program faults. In 
fact, it represents a significant success that precision is so 
high.) 


4.3 Comparing policies 


Our experiments show that our technique for selecting 
predicates for program analysis, along with the four poli- 
cies for partitioning the data being analyzed, are success- 
ful. The resulting implications substantially eased the task 
of program verification and frequently pointed out behavior 
induced by errors. 

It is natural to ask which of the splitting policies is best: 
which one should be used in preference to the others? Un- 
fortunately, there is no clear winner: each approach is best 
in certain contexts or on certain programs, and overall the 
approaches are complementary. This is not surprising: pro- 
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gramming tasks, and programs themselves, differ enough 
that no approach is a panacea. (In some cases, such as er- 
ror detection in print_tokens2, no policy worked well, 
and future research is required.) We can draw some general 
conclusions, however. 


On average, the random selection policy has a slight 
edge. This technique has no built-in biases regarding what 
sort of behavioral differences may exist, so it can do well 
regardless of whether those differences have some obvious 
underlying structure. Thanks to eliminating statistically un- 
justified properties (often resulting from data sets that are 
too small), the technique does not produce an excessive 
number of false positives. On the other hand, random se- 
lection does not work well when there is a relatively equal 
split between behaviors, and random selection cannot take 
advantage of structure when it is present (e.g., from source 
code). 


Clustering is the second-best policy. It looks for struc- 
ture in a vector space composed of values from the traces 
being analyzed. Fairly often, the structure in that space cor- 
responded to real differences in behavior, either for verifi- 
able behavior or due to errors. Clustering is less effective 
when the behavioral differences do not fall into such clus- 
ters in that vector space, however, and there can be prob- 
lems with recognizing the correct number of clusters. It is 
interesting that the purely dynamic policies, which examine 
run-time values but ignore the implementation details of the 
source code, perform so well. This suggests that the struc- 
ture of useful implications sometimes differs from the sur- 
face structure of the program itself; this observation could 
have implications for static and dynamic analysis. 


The code conditional policy performs only marginally 


worse than the previous two. For the program verification 
task, it dominated the other policies, except on QueueAr, 
where it did very poorly. We speculate that the code 
conditional policy is well-matched to ESC/Java because 
ESC/Java verifies the program source code, and it often 
needs to prove conditions having to do with code paths 
through conditionals. Furthermore, whereas errors can be 
complicated and subtle, ESC/Java is (intentionally) limited 
in its capabilities and in its grammar, and is matched to the 
sort of code that (some) programmers write in practice. The 
poor showing on QueueAr is due at least in part to the 
fact that comparisons over variables front and back were 
crucial to verification, but those variables never appeared in 
the same procedure, much less in the same expression. In 
other cases, the code conditionals policy was hampered by 
the fact that ESC/Java does not permit method calls in an- 
notations. Our code conditionals policy partially worked 
around this by inlining the bodies of all one-line methods 
(i.e., of the form “return expression’), with parameters 
appropriately substituted, in each conditional expression. 

The procedure return policy performed worse than we 
anticipated. Like the code conditionals policy, it is highly 
dependent on the particular implementation chosen by the 
programmer. In some cases, this worked to its advantage; 
for instance, it outperformed all other policies at error de- 
tection for the replace program. 


5 Related work 


Clustering [JMF99] aims to partition data so as to reflect 
distinctions present in the underlying data. It is now widely 
used in software engineering as well as in other fields. As 
just one example of a use related to our clustering splitting 
policy, Podgurski et al [PMM*99] use clustering on exe- 
cution profiles (similar to our data traces) to differentiate 
among operational executions. This can reduce the cost of 
testing. In related work, Dickinson et al [DLPO1] use clus- 
tering to identify outliers; sampling outlier regions is effec- 
tive at detecting failures. 

Comparing behavior to look for differences has long 
been applied by working programmers; however, both this 
research and some related research has found new ways to 
apply those ideas to the domain of error detection. 

Raz et al [RKS02] used the Daikon implementation (al- 
beit without most of the implication techniques discussed 
in this paper) to detect anomalies in online data sources. 
Hangal and Lam [HLO02] used dynamic invariant detection 
in conjunction with a checking engine and showed that the 
techniques are effective at bug detection. Related ideas re- 
garding comparison of properties, but applied in a static 
context, were evaluated by Engler et al [ECH*01], who 
detected numerous bugs in operating system code by ex- 
ploiting the same underlying idea: when behavior is incon- 
sistent, then a bug is present, because one of the behaviors 
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must be incorrect. An automated system can flag such in- 
consistencies even in the absence of a specification or other 
information that would indicate which of the behaviors is 
erroneous. 

Groce and Visser [GV03] use dynamic invariant detec- 
tion to determine the essence of counterexamples: given a 
set of counterexamples, they report the properties that are 
true over all of them. These properties (or those that are true 
over only succeeding runs) abstract away from the specific 
details of individual counterexamples. (This is scenario 1 
of Section 4.2.) 


6 Conclusion 


This paper proposes a technique for improving the qual- 
ity of program analysis. The improvement uses splitting 
conditions to partition data under analysis and then to cre- 
ate implications or conditional properties. It is computa- 
tionally infeasible to consider every possible implication. 
Splitting conditions limit the predicates that are considered, 
but predicates that are not splitting conditions may still ap- 
pear. Concretely, the experimental results show the benefits 
for dynamic detection of likely program invariants. 

The paper proposes four splitting (data partitioning) poli- 
cies that can be used in conjunction with the implication 
technique: return value analysis, simple static analysis to 
obtain code conditionals, clustering, and random selection. 
No policy dominates any other, but on average the latter two 
perform best. We provided preliminary explanations of this 
behavior. 

The paper introduces a two-pass program analysis tech- 
nique that refines inexact, statistical, or inexpressible split- 
ting conditions into ones that can be applied to arbitrary 
runs and understood by humans. This technique leverages 
a base program analysis to produce more useful predicates 
and implications. 

Two separate experimental evaluations confirm the effi- 
cacy of our techniques. First, they improve performance on 
a program verification task: they reduce the number of miss- 
ing properties, which must be devised by a human, by 40%. 
Second, we proposed a methodology for detecting differ- 
ences in behavior between faulty and non-faulty program 
runs, even when the user has not identified which runs are 
faulty and which runs are not. Most conditional behavior in 
a program results from other aspects of program execution 
than errors, but 12% of reported properties directly reflect 
errors. 
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